[test] fix: shrink CP packed SFT functional model by yaoyu-33 · Pull Request #4383 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-06-16T02:03:15Z

Summary

The CP packed SFT functional test now creates a pretrain checkpoint and then loads it for SFT in the same pytest process. Using the full Llama 3.2 1B recipe shape makes the second DDP buffer allocation run near the H100 memory limit and can OOM before the SFT path starts.

This keeps the test coverage but shrinks only the test model shape for both the pretrain and SFT providers, so the checkpoint remains compatible while still exercising:

context parallel size 2
packed SQuAD SFT data
pretrain checkpoint creation
pretrained checkpoint load into SFT

It also runs a GC / CUDA cache cleanup barrier between the two phases.

Verification

python3 -m py_compile tests/functional_tests/test_groups/training/test_seqpacking_cp_example.py
git diff --check
uvx ruff check tests/functional_tests/test_groups/training/test_seqpacking_cp_example.py
uvx ruff format --check tests/functional_tests/test_groups/training/test_seqpacking_cp_example.py

Note: uv run ruff ... is blocked on this local host by the pinned nvidia-resiliency-ext wheel platform tag, so I used isolated uvx ruff for the file-scoped lint checks. GPU validation should come from CI.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot · 2026-06-16T02:03:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-06-16T02:03:39Z

/ok to test b17315a

claude · 2026-06-16T02:04:59Z

Review

LGTM. Clean fix that correctly addresses the OOM issue by shrinking the test model shape while preserving the coverage paths (CP + packing + checkpoint load).

Good details:

_set_existing_attr follows the project convention of guarding against phantom setattr on config dataclasses
GC + cache cleanup + barrier between pretrain and finetune phases is the right way to reclaim GPU memory in a single-process two-phase test
Model dimensions are internally consistent (kv_channels = hidden_size / num_attention_heads = 64)

Suggested test cases

No perf tests impacted.

[test] fix: shrink CP packed SFT functional model

b17315a

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 added area:training Training loop, callbacks, and runtime integration bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 16, 2026

copy-pr-bot Bot temporarily deployed to public June 16, 2026 02:04 Inactive

copy-pr-bot Bot temporarily deployed to test June 16, 2026 02:04 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 02:35 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 02:36 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 02:56 Inactive

yaoyu-33 merged commit 839da68 into main Jun 16, 2026
105 checks passed

yaoyu-33 deleted the yuya/fix-seqpacking-cp-oom branch June 16, 2026 04:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test] fix: shrink CP packed SFT functional model#4383

[test] fix: shrink CP packed SFT functional model#4383
yaoyu-33 merged 1 commit into
mainfrom
yuya/fix-seqpacking-cp-oom

yaoyu-33 commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

yaoyu-33 commented Jun 16, 2026

Uh oh!

claude Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaoyu-33 commented Jun 16, 2026

Summary

Verification

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

yaoyu-33 commented Jun 16, 2026

Uh oh!

claude Bot commented Jun 16, 2026

Review

Suggested test cases

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant